4 Main Analysis

Note: the data is zipped in the data folder when downloaded and must be unzipped in order to read

Exloratory Data Analysis

This file explores the cleaned version of the dataset provided by the ‘part4_cleaning.rmd’ file.

To begin, we read in the cleaned data as well as some additional data we will compare against later on. For this analysis, we binned the shapes into five categoriexs.

ufo <- as_tibble(read_csv("data/ufo_clean_final.csv"))
ufo <- ufo %>% select(-X1, -X1_1)

state_pops <- as_tibble(read_csv("data/StatePops.csv"))
excer <- as_tibble(read.csv("data/Excercise.csv"))
colnames(excer)[2] <- "Excercise"

trump_state <- as_tibble(read_csv("data/TrumpState.csv"))

obese <- as_tibble(read_csv("data/Obesity.csv"))

cbPalette <- c("#56B4E9", "#F0E442", "#E69F00", "#009E73", "#CC79A7", "#D55E00", "#999999", "#0072B2")

ufo <- ufo %>%
  mutate(Shape2 = fct_relevel(Shape2, "Circle", "Light", "Triangle", "Rectangle", "Other"))

NA NA NA NA… Missing Data Analysis

As seen in the data quality report, nearly all sightings reported contain complete data, and the observations missing some data are mostly just missing a single field value.

visna(ufo[, c('Duration','Shape','State','Country','Time','City','Desc','Summary' )], sort = "b")

We considered the possibility that international reports may be more likely to have missing data. However, we found that this is not the case. Missing data is consistent across reporting countries.

percent_missing <- ufo %>% group_by(`In_USA`) %>%
summarize(num = n(), num_na = sum(is.na(`Shape`))) %>%
mutate(percent_na = round(num_na/num, 2)) %>%
arrange(-percent_na)
percent_missing
## # A tibble: 2 x 4
##   In_USA    num num_na percent_na
##    <lgl>  <int>  <int>      <dbl>
## 1  FALSE  13268   1364        0.1
## 2   TRUE 101570  10253        0.1

We also wondered whether any component of the sighting itself would impact what data was omitted when filing the report. Here, we consider whether the ‘Shape’ of the object has any impact on missing data, and found that there is little variance between shapes.

percent_missing <- ufo %>% group_by(`Shape1`) %>%
summarize(num = n(), num_na = sum(is.na(`Duration`))) %>%
mutate(percent_na = round(num_na/num, 2)) %>%
arrange(-percent_na)

ggplot(percent_missing, aes(x=Shape1, y=percent_na)) +
  geom_bar(stat='sum',fill='dodgerblue') +
  xlab('Shape') + ylab('% Missing') +
  theme_light() + ggtitle('Percent of Missing Values by Shape') + theme(legend.position = 'none') + scale_y_continuous(labels = percent)

Review Sightings by State

Here we split the duration into discrete bins to make future plotting easier

USA_ufo <- ufo %>% filter(In_USA == TRUE & is.na(Shape2) == FALSE)
bin_duration <- function(duration){
  
  if(is.na(duration) == TRUE){
    return(NA)
  }
  
  if(duration < 60){
    return("< 1")
  }
  else if(duration < 300){
    return("1 - 5")
    
  }
  else if(duration < 600){
    
    return("5 - 10")
  }
  
  else if(duration < 1800){
    return("10 - 30")
  }
  
  else if (duration < 3600){
    return("30 - 60")
  }
  else{
    return("60+")
  }
  return(NA)
  
}

Unidentified Flying Graphs

ufo <- ufo %>% mutate(`Bin_Dur` = lapply(ufo$Duration, bin_duration))
ufo$Bin_Dur <- unlist(ufo$Bin_Dur)
ufo <- ufo %>%
  mutate(Bin_Dur = fct_relevel(Bin_Dur, "< 1", "1 - 5", "5 - 10", "10 - 30", "30 - 60", "60+"))

As we would expect, the number of reported sightings increases with the availability of the internet, especially after the creation of the NUFORC website in 1998.

ggplot(ufo, aes(x = year)) + geom_bar(fill='dodgerblue')  + scale_y_continuous(name = "Sightings (thousands)", breaks = c(2000,4000,6000,8000), labels = c(2,4,6,8))+
  ggtitle("Sighting Frequency (Full Dataset Timescale)") + theme_light()

ufo1950 <- ufo %>% filter(year>=1950)
ggplot(ufo1950, aes(x = year)) + geom_bar(fill='dodgerblue') +
  ggtitle("Sighting Frequency (1950-2018)") + theme_light() + scale_y_continuous(name = "Sightings (thousands)", breaks = c(2000,4000,6000,8000), labels = c(2,4,6,8))

As seen below, the most frequent shape of observations in the data are circular shaped observations with light as a close second. This is consistent with the long portrayed image of a ‘UFO’ in pop culture. Many science fiction media portay UFOs as saucer type objects that have lights, often flashing. The third most common shape is triangluar. This is consistent with a more recent phenomenon in the UFO sighting community known as the Black Triangle theory (https://www.space.com/302-silent-running-black-triangle-sightings-rise.html)

ggplot(ufo, aes(x = Shape2)) + geom_bar(fill='coral2') +
  ggtitle("Sighting Frequency by Observed Shape") +
  xlab("Shape") + scale_y_continuous(name='Sightings (thousands)', labels = c(10,20,30),breaks = c(10000,20000,30000)) + theme_light()

Most of the sightings in the dataset are fairly short (between seconds and 5 minutes); however, there are a significant amount of sightings of all durations.

ggplot(ufo, aes(x = Bin_Dur)) + geom_bar(fill='darkcyan') + labs(x = "Duration in Minutes", y = "Number of Sightings") +
  ggtitle("Sighting Frequency by Duration") + scale_y_continuous(name='Sightings (thousands)', labels = c(10,20,30),breaks = c(10000,20000,30000)) + theme_light()

Zooming in to the sightings less than one hour, we can make a few observations.

  • Duration is heavily right skewed, with most sightings last less than 5 minutes
  • There is evidence of rounding in reporting duration with nearly all values greater than 10 minutes, rounded to 5 minute intervals. This trend is even more evident when looking at the 30 min, 45 min, and 60 min intervals.
ufodurclip <- ufo %>% filter(is.na(Duration) == FALSE & Duration <= 3600)
ggplot(ufodurclip, aes(x = Duration)) + geom_histogram(, color="black", bins=60,fill="springgreen3") +
  ggtitle("Duration Histogram, Limited to 1 Hour") + scale_y_continuous(name='Sightings (thousands)', labels = c(10,20,30),breaks = c(10000,20000,30000)) + scale_x_continuous(name = "Duration (minutes)", breaks = seq(0,3600,300), labels = seq(0,60,5)) + theme_light()

Duration of observation is independent of the shape. That is, UFOs of circular shape are no more or less likely to be seen for a longer amount of time than, say, triangular UFOs.

temp <- ufo %>% filter(year >= 1995 & year != 2018 & is.na(Bin_Dur) == FALSE) %>% group_by(Bin_Dur) %>% summarize(durfreq = n())
durshape <- ufo %>% filter(year >= 1995 & year != 2018 & is.na(Bin_Dur) == FALSE) %>% group_by(Bin_Dur, Shape2) %>% summarize(durshapefreq = n())
durshape <- merge(durshape, temp, by="Bin_Dur")
durshape$prop <- durshape$durshapefreq/durshape$durfreq

ggplot(durshape, aes(x = Shape2, y = prop, fill = Shape2)) +
  geom_bar(stat = "identity") + facet_wrap(~Bin_Dur) +
  ggtitle("Shape Proportion Across Durations") +
  xlab("Duration in Minutes") + scale_fill_manual(values=cbPalette, name = "Shape") + scale_y_continuous(labels = percent) + ylab("Proportion of Sightings")+ theme(axis.text.x = element_text(angle = 45, hjust = 1)) + theme_light()

As expected, the majority of UFO sighting happen between 6pm and midnight. Late enough that it’s dark, but early enough that people haven’t gone to sleep yet. We choose to use 24 bins (one for each hour of the day) to smooth the effects of rounding in the trend. We can also notice a bump at 12noon, indicating typical lunch break when people are more inclined to be outside.

ggplot(ufo, aes(as.POSIXct(Time))) +
  geom_histogram(bins=24, color="black", fill="springgreen3") +
  ggtitle("UFO Sighting Frequency Over Time") +
  labs(y = "Frequency") +
  # scale_x_datetime(breaks = date_breaks("1 hour"))
  scale_x_datetime(breaks = date_breaks("2 hour"), labels = date_format("%H:%M")) + 
  xlab("Time of Day") + 
  scale_fill_viridis(name = "Frequency") + theme_light()+ scale_y_continuous(name='Sightings (thousands)', labels = c(2,4,6,8,10,12,14),breaks = c(2000,4000,6000,8000,10000,12000,14000))

We can also observe that there are significantly more sightings during the warmer months of the year (noting that all of the reports are in the northern hemisphere). We suspect that these are the months where people are more inclined to spend time outside and therefore have more opportunity to observe the UFOs.

Now lets focus on the ‘description’ field. This field is rightfully free form, and allows us to do some text-based analysis. First, we will look at how descriptive the reports are and if there are any other features that entice reporters to provide longer descriptions. Later (see part67.rmd) we consider the word frequency and two-word affinity within comments.

We can see that descriptions tend to be 200 words or less with a mean around 100 words, but there is a significant percentage of descriptions that exceed 400 words.

ufoShortDescription <- filter(ufo, DescriptionLength < 1000)

ggplot(ufoShortDescription, aes(DescriptionLength)) +
  geom_histogram(bins=70, color="black", fill="firebrick3") +
  ggtitle("UFO Sighting Description Word Count by time of Day and Shape") +
  labs(y = "Frequency", x = "Word Count") + scale_fill_viridis(name = "Frequency") + scale_y_continuous(name = "Sightings (thousands)", breaks = seq(2000,8000,2000), labels = seq(2,8,2)) + theme_light() + scale_x_continuous(name = "Word count", breaks = seq(0,1000,100), labels = seq(0,1000,100))

We considered whether observing certain shapes would encourage a longer description. When looking for this trend, we actually uncovered a separate, but equally interesting one. During the daylight hour observations, there is a clear drop off of “light” observations. This seems obvious due to lighted objects being more difficult to spot during daytime. This trend can be seen as the color gradient changes over time in the scatter plot below. At night, as we saw earlier, there is an increased frequency of reports, and the distribution of shapes is more unifrom.

ggplot(ufoShortDescription, aes(x=as.POSIXct(Time), y=DescriptionLength)) +
  geom_point(alpha=0.2, aes(color = Shape2)) +
  ggtitle("Word Count by time of Day and Shape") +
  labs(y = "Word Count", x = "Time")  +
  guides(colour = guide_legend(override.aes = list(alpha = 1))) + 
  scale_color_manual(values=cbPalette, name = "Shape") + 
  scale_x_datetime(breaks = date_breaks("2 hour"), labels = date_format("%H:%M")) + theme_fivethirtyeight()

To reinforce the above trend, when faceting the same information, we can see that circles are more prominent during the day, and light sightings are more frequent at night.

temp8 <- ufoShortDescription %>% filter(year >= 1995 & year != 2018 & is.na(Shape2) == FALSE) %>% group_by(Time, Shape2) %>% summarize(freq = n())
ggplot(temp8, aes(Time, freq, color = Shape2)) + geom_line() + facet_wrap(~Shape2) + 
    ggtitle("UFO Sighting Shapes Over Time of Day") +
    labs (x = "Time", y = "Sightings") +  scale_color_manual(values=cbPalette, name = "Shape") + theme_fivethirtyeight()

Geospatial Analysis

Before pursuing any geography based analysis, we must control for population density, as most of the most popular states (California) have large numbers of sightings due to it’s size. This is seen in the graph below showing absolute number of sightings per state.

ufostate <- ufo %>% filter(is.na(State) == FALSE & In_USA == TRUE)


ggplot(ufostate, aes(x = State)) + geom_bar() +
  ggtitle("Sightings by State (no population correction)") + theme_light() + theme(axis.text.x=element_text(angle=60, hjust = 1)) + ylab('Sightings')

Now that we have corrected for population density, lets consider the states with the highest per-capita sightings reported - WA, VT, MT, AK, OR, and ME. We can notice that 4 of these states are on the coasts. Additionally, there seems to be a west coast bias to frequency of observations. We rationalize this by noting that the headquarters of NUFORC is in Washington state. We also observe that these states are fairly rural. In the next section, we will see that these states favor outdoor activities, unlike other rural states such as Alabama, Texas, and Georgia, where people tend to spend their time indoors.

USA_ufo_merged <- merge(USA_ufo, state_pops, by = "State")
USA_ufo_summary <- USA_ufo_merged %>% filter(year >= 1995) %>% group_by(State, Shape2, Population) %>% summarise(freq = n()) %>% mutate(`Prop` = 10000*freq/Population)

temp5 <- USA_ufo_merged %>% filter(year >= 1995) %>% group_by(State, Population) %>% summarise(freq = n()) %>% mutate(`Prop` = 10000*freq/Population)
temp5$state <- temp5$State
statebins(temp5, value_col = "Prop", text_color =  "black", font_size = 3, legend_title = "Sightings Per 10,000 Residents", legend_position =  "bottom") + labs(title = "The Coasts are More Supernatural")

ggplot(USA_ufo_summary, aes(x = reorder(State, -Prop), y = Prop)) + geom_bar(stat = "identity", fill = 'navy') + labs(y = "Sightings per 10,000 Residents", x = "State") +
  ggtitle("Sightings by State, Population Corrected") + theme_light() + theme(axis.text.x=element_text(angle=90, hjust = 1)) 

Extraterrestrial Correlations

We use three different proxies to closely explore the correlation between amount of time spent outside with frequency of UFO sightings:

As there is no measure for how much time people spend outside, we believe that these 3 proxies are likely correlated with outdoor activity.

All three of these graphs show a clear correlation. Though there are many individual states that do not follow this pattern, overall the trend is clear - more exercise is correlated with more UFO sightings and a lower percentage of obese adults is correlated with more UFO sightings.

USA_ufo_summary2 <- USA_ufo_merged %>% filter(year >= 2009) %>% group_by(State, Population) %>% summarise(freq = n()) %>% mutate(`Prop` = 10000*freq/Population) 
USA_ufo_summary2 <- merge(USA_ufo_summary2, excer, by = "State")
ggplot(USA_ufo_summary2, aes(x = ATUS, y = Prop)) + geom_point(stat = "identity") + labs(x = "Minutes a day spent doing Sports, Excercise, Recreation", y = "Sightings per 10,000 Residents") + geom_text_repel(label=USA_ufo_summary2$State)+ geom_smooth(method='lm',formula=y~x) + theme_light() + scale_x_continuous(breaks = seq(0,.7,.1),labels = seq(0,42,6)) +
ggtitle("Sightings by Exercise (2009 - Present)")

ggplot(USA_ufo_summary2, aes(x = Excercise, y = Prop)) + geom_point(stat = "identity") + labs(x = "Percentage of population engaged in sports and exercise on an average day", y = "Sightings per 10,000 Residents") + geom_text_repel(label=USA_ufo_summary2$State) + geom_smooth(method='lm',formula=y~x) + theme_light() + ggtitle("Sightings by Population engaged in Sports/Exercise (2009 - Present)")

USA_ufo_summary2 <- merge(USA_ufo_summary2, obese, by = "State")
ggplot(USA_ufo_summary2, aes(x = Obesity, y = Prop)) + geom_point(stat = "identity") + labs(x = "Percentage of Obese Adults", y = "Sightings per 10,000 Residents") + geom_text_repel(label=USA_ufo_summary2$State) + geom_smooth(method='lm',formula=y~x) + theme_light() + ggtitle("Sightings Correlated with Obesity (2009 - Present)")

Given this evidence, we have confidence that states where residents are more likely to spend time outside are also more likely to observe UFOs. This seemingly trivial result confirms our common sense perception that you have to spend time looking at the sky to see something unidentifiable.

We wanted to see if other state-by-state features were correlated with sightings per 10,000 residents, including political affiliation. Filtering for obervations in the last four years, we can see a slight negative trend with Donald Trump’s 2016 election margin by state. Though this is open for interpretation, in our opinion, we find it likely that Trump voters are much likely to spend time outside/exercising.

USA_ufo_summary3 <- USA_ufo_merged %>% filter(year >= 2014 & State != "DC") %>% group_by(State, Population) %>% summarise(freq = n()) %>% mutate(`Prop` = 10000*freq/Population)
USA_ufo_summary3 <- merge(USA_ufo_summary3, trump_state, by = "State")
ggplot(USA_ufo_summary3, aes(x = `Trump Margin`, y = Prop)) + geom_point(stat = "identity") + labs(x = "Percentage of Popular Vote for Donald Trump in 2016", y = "Sightings per 10,000 Residents") + geom_text_repel(label=USA_ufo_summary3$State) + geom_smooth(method='lm',formula=y~x) + theme_light() + ggtitle("Sightings correlated with Political Leaning (2014 - Present)")